Working with healthcare data introduces preprocessing challenges that go beyond those you might encounter with structured data. Some familiar techniques still apply, while others look very different once your data becomes medical images.

In this article, you’ll learn how to prepare a real-world medical imaging dataset for machine learning, from initial data validation to a complete preprocessing pipeline.

We’ll use the Chest X-Ray Pneumonia dataset as our running example, but the lessons apply broadly to healthcare imaging data, including ultrasound, MRI, CT, and dermatology images.

What You'll Learn in This Article

By the end of this article, you'll know how to:

  • Approach healthcare data preprocessing differently from preprocessing structured data, and recognize where standard techniques fall short

  • Validate a medical imaging dataset before training to catch corrupted files, mislabels, and data leakage between train and test

  • Apply six core preprocessing techniques for medical images

  • Build a complete preprocessing pipeline for chest X-rays using Python with OpenCV.

What We'll Cover:

  • Why Preprocessing Data Matters More in Healthcare

  • The Dataset

  • Before Preprocessing: Validate the Dataset

  • The Six Pillars of Healthcare Imaging Preprocessing

  • Pillar 1: Scaling — Making the Numbers Play Fair

  • Pillar 2: Normalization — Centering the Data

  • Pillar 3: Guiding the Model's Attention

  • Pillar 4: Handling Missing Data

  • Pillar 5: Resizing & Resampling — Fitting Everything in the Same Frame

  • Pillar 6: Denoising & Artifact Handling — Cleaning the Window

  • Putting it All together: A Complete Pipeline

  • Try it Yourself

  • Conclusion

Why Preprocessing Data Matters More in Healthcare

Imagine handing a toddler a jigsaw puzzle with missing pieces, warped edges, and pieces from three different puzzles mixed together. The toddler can't solve it, but that isn't really the toddler's fault.

The same thing happens when raw, messy data gets fed into a machine learning model. A bad prediction on a clinical image can mean a missed diagnosis.

Illustration showing a healthcare data preprocessing workflow. Mixed medical images with different sizes, missing labels, noisy scans, and corrupted files enter a preprocessing pipeline and emerge as clean, standardized, model-ready images ready for machine learning.

Healthcare data tends to be messier than what most ML practitioners are used to:

  • Images come from different machines, hospitals, and acquisition protocols

  • Labels are inconsistent, sometimes missing, sometimes wrong

  • Patient data is incomplete

  • Image sizes, contrast levels, and orientations vary across sources

Poor preprocessing often leads to models that perform well on benchmark datasets but struggle to generalize to data collected from different hospitals or imaging devices.

The Dataset

This guide uses the Chest X-Ray Pneumonia datasetby Paul Mooney on Kaggle. It's a strong choice for learning preprocessing because:

  • It contains around 5,800 pediatric chest X-rays

  • It has two clear classes — Normal and Pneumonia

  • It's already organized into train, validation, and test folders

  • The images are recognizable without specialized medical training

  • It exhibits almost every preprocessing challenge worth learning

The dataset is available at Kaggle: Chest X-Ray Pneumonia.

Folder Structure

After downloading, the dataset is organized like this:

chest_xray/├── train/│   ├── NORMAL/│   └── PNEUMONIA/├── val/│   ├── NORMAL/│   └── PNEUMONIA/└── test/    ├── NORMAL/    └── PNEUMONIA/

Side-by-side comparison — Normal vs Pneumonia chest X-ray:

Side-by-side chest X-ray images showing a normal lung scan on the left and a pneumonia scan on the right. The pneumonia image contains visible cloudy opacities compared with the clearer lung fields in the normal image.

A quick first look at one of the images:

import osimport numpy as npimport matplotlib.pyplot as pltfrom PIL import Imageimport cv2DATA_DIR = "chest_xray"TRAIN_DIR = os.path.join(DATA_DIR, "train")# Peek at a sample imagesample_path = os.path.join(TRAIN_DIR, "NORMAL", os.listdir(os.path.join(TRAIN_DIR, "NORMAL"))[0])sample_image = cv2.imread(sample_path, cv2.IMREAD_GRAYSCALE)print(f"Image shape: { sample_image.shape}")print(f"Pixel range: { sample_image.min()} to { sample_image.max()}")print(f"Data type: { sample_image.dtype}")

The output reveals a few useful things right away: most images are large (often around 1500×2000 pixels), pixel values fall in the 0–255 range, and image sizes vary across the dataset. Each of these observations will inform a preprocessing step.

Before Preprocessing: Validate the Dataset

Before applying any transformations, it's worth checking that the data itself is intact. This step alone catches issues that would otherwise cause training to fail silently or produce misleading results.

A simple validation function:

def validate_dataset(data_dir):    """Scan a dataset folder and flag common data quality issues."""    corrupted = []    too_small = []    nearly_black = []    total = 0        for class_name in os.listdir(data_dir):        class_path = os.path.join(data_dir, class_name)        if not os.path.isdir(class_path):            continue        for fname in os.listdir(class_path):            fpath = os.path.join(class_path, fname)            total += 1            try:                img = cv2.imread(fpath, cv2.IMREAD_GRAYSCALE)                if img is None:                    corrupted.append(fpath)                    continue                if img.shape[0] < 100 or img.shape[1] < 100:                    too_small.append(fpath)                if img.mean() < 5:                    nearly_black.append(fpath)            except Exception:                corrupted.append(fpath)        print(f"Total files scanned: { total}")    print(f"Corrupted: { len(corrupted)}")    print(f"Too small: { len(too_small)}")    print(f"Nearly black: { len(nearly_black)}")    return corrupted, too_small, nearly_blackvalidate_dataset(TRAIN_DIR)

Common issues this catches:

  • Corrupted files— files that won't open at all

  • Empty or nearly-black images— failed acquisitions or saved-as-blank files

  • Wrong dimensions— thumbnails or partial downloads mixed in

  • Duplicate images— the same scan appearing in both train and test (this causes data leakage)

  • Mislabeled images— a normal X-ray placed in the pneumonia folder

⚠️ This step is critical, One corrupted file can crash a training loop hours into a run. One duplicate between train and test can inflate accuracy scores by several percentage points without anyone noticing.

The Six Pillars of Healthcare Imaging Preprocessing

Preprocessing for medical images can be organized around six core concerns. Two of them carry over directly from preprocessing structured data. Two need to be adapted because the mechanics change when the input is an image. And two are entirely new, they only exist once the data becomes pictures of human bodies.

Pillar 1: Scaling — Making the Numbers Play Fair

Imagine two children comparing their collections. One has 3 seashells. The other has 3,000 stickers. Asking who has more makes the answer seem obvious, but the scalesare completely different. Comparing them meaningfully means putting both collections on the same measuring system.

In medical images, pixels usually range from 0 to 255 in 8-bit images, or 0 to 65,535 in some 16-bit medical DICOM images. Neural networks tend to train faster and more reliably when input values are small numbers close to zero.

Histogram comparison showing chest X-ray pixel values before and after scaling. The left histogram displays values in the 0–255 range, while the right histogram shows the same distribution scaled to the 0–1 range used for machine learning.

The fix:Divide every pixel by its maximum possible value, bringing everything into the 0-to-1 range.

image = cv2.imread(sample_path, cv2.IMREAD_GRAYSCALE)# Scale to [0, 1]image_scaled = image.astype(np.float32) / 255.0print(f"Before scaling: { image.min()} to { image.max()}")print(f"After scaling:  { image_scaled.min():.3f} to { image_scaled.max():.3f}")

Takeaway:Pixel scaling follows the same principle as scaling any numerical feature. The values simply happen to be arranged as an image rather than a column.

Pillar 2: Normalization — Centering the Data

Imagine a teacher asks a class to rate a movie from 1 to 10. One child always gives 9s and 10s. Another spreads ratings evenly from 1 to 10. Comparing their opinions fairly requires adjusting each child's score relative to their own average.

In medical imaging even after scaling to 0–1, the overall brightness of images can vary. Some X-rays are taken with stronger exposure than others. Normalization shifts and rescales each image (or each channel) so the values are centered around zero with a standard deviation of one.

The fix:Subtract the mean, divide by the standard deviation.

# Compute mean and std from the TRAINING set only — never from validation or testdef compute_train_stats(train_dir, sample_limit=1000):    """Compute pixel mean and std across the training set."""    pixel_values = []    count = 0    for class_name in os.listdir(train_dir):        class_path = os.path.join(train_dir, class_name)        for fname in os.listdir(class_path):            if count >= sample_limit:                break            img = cv2.imread(os.path.join(class_path, fname), cv2.IMREAD_GRAYSCALE)            if img is not None:                pixel_values.append(img.astype(np.float32).flatten() / 255.0)                count += 1    pixels = np.concatenate(pixel_values)    return pixels.mean(), pixels.std()train_mean, train_std = compute_train_stats(TRAIN_DIR)image_normalized = (image_scaled - train_mean) / train_std

⚠️Avoid this common mistake: Statistics for normalization should be computed from the training set only, never from validation or test. Including those in the calculation leaks information from the evaluation data into the model. The same statistics should then be applied to validation, test, and any new data at inference time.

Takeaway:Centering and scaling each image around the dataset's statistics is the imaging equivalent of standardizing a feature column. The pixels are now comparable across images, regardless of how bright or dim each scan happened to be.

Pillar 3: Guiding the Model's Attention

Imagine a child walking into a crowded pet store. Instead of describing every animal in sight, a parent points to the features that matter: “Look at the soft fur, the fluffy tail, and the nice small size.”The child learns where to focus their attention.

Medical image preprocessing does something similar. It highlights the regions and features most relevant to the diagnostic task.

  • Region-of-interest (ROI) cropping— focus on the lung field and discard the patient's arms, machine borders, and any imprinted text

  • Contrast enhancement— use techniques like CLAHE (Contrast Limited Adaptive Histogram Equalization) to make subtle lung textures more visible

  • Channel selection— for images stored as RGB but containing grayscale information, convert to single-channel input to reduce noise

Three-panel illustration showing a chest X-ray before and after feature enhancement. The first panel shows the original image, the second highlights the lung region of interest, and the third shows the image after CLAHE contrast enhancement with lung textures appearing more visible.

CLAHE applied to an X-ray:

# CLAHE enhances local contrast — useful for X-raysclahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))image_enhanced = clahe.apply(image)# Visualize the differencefig, axes = plt.subplots(1, 2, figsize=(12, 6))axes[0].imshow(image, cmap='gray')axes[0].set_title('Original')axes[1].imshow(image_enhanced, cmap='gray')axes[1].set_title('After CLAHE')plt.show()

Takeaway:The goal of teaching the model what to look at hasn't changed. With structured data, the answer is in new columns. With images, the answer is in cropping, enhancement, and emphasizing the regions that carry diagnostic signal.

Pillar 4: Handling Missing Data

Imagine reading a storybook with a few damaged pages. You don’t throw away the entire book, you decide whether to skip the page, infer what might be missing, or mark it for review.

In medical imaging, missing data can mean corrupted files, missing labels, or incomplete studies rather than empty spreadsheet cells.

The same three strategies — drop, impute, flag — still apply, just with different mechanics:

# Strategy 1: Drop — remove unreadable or empty imagesdef is_valid_image(path):    try:        img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)        if img is None:            return False        if img.mean() < 5:           # nearly black            return False        if img.shape[0] < 50 or img.shape[1] < 50:  # too small            return False        return True    except Exception:        return False# Strategy 2: Impute — rare for images, but possible (e.g., in painting to fill in missing patches). Generally avoided for diagnostic data.# Strategy 3: Flag — track which patients are missing which modalities,#   and let the model condition on availability. Common in multi-modal healthcare ML.

Takeaway:"Missing" in imaging data is rarely just a NaN. It can be a broken file, an unlabeled scan, an absent modality, or a black corner inside an image. The same three strategies still apply.

Pillar 5: Resizing & Resampling — Fitting Everything in the Same Frame

Imagine displaying children’s drawings on a classroom wall. If every drawing is a different size, they won’t fit neatly into the display. You resize them while preserving their proportions.

Medical images must often be resized to a common input size, but anatomical structures should retain their original shape.

Comparison of two chest X-ray resizing approaches. One image is stretched into a square shape, distorting the lungs, while the second preserves the original aspect ratio by adding padding around the image. The aspect-ratio-preserving approach is highlighted as the preferred method.

The fix:Resize all images to a common shape. For medical data, howthe resizing is done matters.

TARGET_SIZE = (224, 224)# Simple resize (may distort aspect ratio)image_resized = cv2.resize(image, TARGET_SIZE)# Better: preserve aspect ratio with paddingdef resize_with_padding(image, target_size):    h, w = image.shape[:2]    target_h, target_w = target_size    scale = min(target_h / h, target_w / w)    new_h, new_w = int(h * scale), int(w * scale)    resized = cv2.resize(image, (new_w, new_h))        pad_h = target_h - new_h    pad_w = target_w - new_w    top, bottom = pad_h // 2, pad_h - pad_h // 2    left, right = pad_w // 2, pad_w - pad_w // 2    padded = cv2.copyMakeBorder(resized, top, bottom, left, right,                                 cv2.BORDER_CONSTANT, value=0)    return paddedimage_clean_resize = resize_with_padding(image, TARGET_SIZE)

⚠️ Why aspect ratio matters in healthcare:Squishing a chest X-ray horizontally makes the lungs look unnatural. Models trained on distorted anatomy often perform worse on real scans. Preserving aspect ratio is generally the safer choice.

Takeaway:Models need a consistent input size, but the geometry of the anatomy needs to be preserved. Resize, but resize carefully.

Pillar 6: Denoising & Artifact Handling — Cleaning the Window

Imagine looking through a window with dust and smudges on the glass. Cleaning the window makes the view clearer, but scrubbing too aggressively could scratch the glass.

Similarly, medical images often contain noise and acquisition artifacts that should be reduced carefully without removing clinically important details.

For chest X-rays, the most common issues are mild noise and burned-in text or markers. A gentle median or bilateral filter helps with the first, while cropping or masking helps with the second.

# Gentle denoising — careful not to blur away clinical detailimage_denoised = cv2.medianBlur(image, ksize=3)# Bilateral filter preserves edges better than a median filterimage_bilateral = cv2.bilateralFilter(image, d=5, sigmaColor=50, sigmaSpace=50)

⚠️ A note of caution:Aggressive denoising can erase the features a model needs to detect a disease. For diagnostic ML, gentle filtering is generally preferred. A useful rule of thumb: if a radiologist can't distinguish the cleaned image from the original, the filtering has gone too far.

Takeaway:Imaging data carries noise that structured data doesn't have. The window can be cleaned, but never so aggressively that the view is wiped away with the smudges.

Putting it All Together: A Complete Pipeline

Workflow showing a chest X-ray progressing through a healthcare imaging preprocessing pipeline. The image moves through validation, resizing, denoising, contrast enhancement, scaling, and normalization before becoming a model-ready machine learning input.

Here's how the six pillars combine into a single preprocessing function for chest X-ray images:

def preprocess_xray(image_path, target_size=(224, 224),                    train_mean=0.482, train_std=0.236):    """    Full preprocessing pipeline for chest X-ray images.    Applies all six pillars in order.    """    # Pillar 4: Validate first — skip corrupted files    if not is_valid_image(image_path):        return None        image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)        # Pillar 5: Resize with aspect ratio preserved    image = resize_with_padding(image, target_size)        # Pillar 6: Gentle denoising    image = cv2.medianBlur(image, 3)        # Pillar 3: Enhance contrast to highlight lung texture    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))    image = clahe.apply(image)        # Pillar 1: Scale to [0, 1]    image = image.astype(np.float32) / 255.0        # Pillar 2: Normalize using training set statistics    image = (image - train_mean) / train_std        return image

Try it Yourself

Every code snippet in this article is bundled into a runnable Kaggle notebook: Chest X-Ray Preprocessing — Kaggle Notebook. Fork it, attach the dataset, and run all the cells to see each preprocessing pillar in action on real chest X-rays.

Conclusion

Here's a summary of what we've discussed in this article:

PillarPurposeExample
ScalingStandardize pixel ranges0-255 → 0-1
NormalizationCenter brightness distributionsz-score normalization
Attention GuidanceHighlight diagnostic regionsCLAHE
Missing Data HandlingRemove unusable scansCorrupted files
ResizingConsistent input size224×224
DenoisingReduce acquisition noiseMedian filter

Preprocessing for structured data is about making numbers play fair so a model can see them clearly.

Preprocessing for healthcare imaging is about respecting the messy reality of how medical data is captured, stored, and labeled. Some standard techniques carry over directly. Some need to be adapted. And a few preprocessing concerns only emerge once the data becomes pictures of human bodies.

Stepping back, whether it's a child learning to organize their toy box, or a model learning to spot pneumonia in a chest X-ray, the quality of learning depends on the quality of data preparation. Get the data right.

If this was useful, you can find a related conceptual primer on preprocessing more broadly here: Data Preprocessing for Machine Learning.